The IKEA store, known for its stylish design and affordable price for young people, has become one of the most popular and recognized furniture retailers in the global marketplace. The company has expanded its global reach very quickly. According to Wikipedia, the first IKEA store in Saudi Arabia opened in 1983 and three stores are opening in this region now. In this study, it is of interest to explore which properties of furniture influence whether they cost more than 1000 Saudi Riyals. We use data from IKEA (Saudi Arabia), including 500 observations and the variables of category, price, sellable_online, other_colors, depth, height and width. In particular, this report further explores the impact level of depth, height and width on the price.
| price | sellable_online | other_colors | depth | height | width | price_level | |
|---|---|---|---|---|---|---|---|
| Min. : 3.0 | Mode :logical | Length:500 | Min. : 1.00 | Min. : 3.0 | Min. : 2.0 | Min. :0.000 | |
| 1st Qu.: 168.8 | FALSE:1 | Class :character | 1st Qu.: 37.00 | 1st Qu.: 68.0 | 1st Qu.: 56.0 | 1st Qu.:0.000 | |
| Median : 457.0 | TRUE :499 | Mode :character | Median : 46.00 | Median : 83.0 | Median : 80.0 | Median :0.000 | |
| Mean : 991.1 | NA | NA | Mean : 53.34 | Mean :102.3 | Mean :101.1 | Mean :0.292 | |
| 3rd Qu.:1245.0 | NA | NA | 3rd Qu.: 60.00 | 3rd Qu.:123.8 | 3rd Qu.:134.2 | 3rd Qu.:1.000 | |
| Max. :8551.0 | NA | NA | Max. :252.00 | Max. :251.0 | Max. :367.0 | Max. :1.000 | |
| NA | NA | NA | NA’s :191 | NA’s :146 | NA’s :80 | NA |
We first took price of 1000 as a separating factor according to the problem, and added a new list of binary variables named price_level. Furniture with a price greater than 1000 takes on price level 1, otherwise it takes value 0. Then we performed descriptive statistical analysis based on these selected variables of 500 observations.
Table 1 shows that the price of the furniture ranges from 3.0 to 8551.0, with the middle 50% falling between 168.8 and 1245.0 and an average price of 991.1. We can also observe that the middle 50% of depth is between 37.0 and 60.0, with depth of 53.3 on average. Similarly, the middle 50% of height lies between 68.0 and 123.8, with an average value of 102.3. It also shows the mean value of width is 101.1 and the middle 50% falls between 56.0 and 134.2. Besides, we may look at the sellable_online variable is a logical variable, with 499 TRUE values demonstrating that these quantities of items are available to purchase online and 1 FALSE value demonstrating that 1 item is unavailable. In terms of other_colors, it shows that this is a character variable.
Distribution of price levels.
Figure 1 shows the number of furniture based on categories with a price_level 1 (greater than 1000) and 0 (less than 1000). It is obvious that the number of furniture whose price is greater than 1000 is far less than that of furniture whose price is less than 1000 and the prices of furniture in certain categories are all under 1000, such as Cffr, Cod&du, Ch’f and Nrsf. Additionally, the largest difference is found in the quantity of B&su costing more than 1000 and less than 1000.
Distribution of Furniture Dimensions.
Figure 2 is a bubble chart, each color represents a type of furniture, and each circle of different sizes represents the height and width of each piece of furniture. It is very intuitive and clear to show that the height and width of different furniture are very different, and the same type of furniture also has many different heights and widths. For example, there are many choices for the types of bookshelves, and their height and width are scattered in various places on the coordinate axis. However, the height and width of the chair are relatively concentrated, and most of them are located in the lower left corner of the coordinate axis, which means that the width and height are relatively small, which is also in line with reality.
Missing original data.
Through the above figure, we found anomalies in the data. There were many missing values and the missing data is mainly concentrated in three explanatory variables, namely depth, length and width. And the three horizontal red squares indicate that these three data are missing at the same time. If we ignore or delete these missing data directly, it will have a great impact on the analysis of the data. So we have chosen to use multiple imputation to fill in missing data.
iter imp variable
1 1 depth height width
1 2 depth height width
1 3 depth height width
1 4 depth height width
1 5 depth height width
2 1 depth height width
2 2 depth height width
2 3 depth height width
2 4 depth height width
2 5 depth height width
3 1 depth height width
3 2 depth height width
3 3 depth height width
3 4 depth height width
3 5 depth height width
4 1 depth height width
4 2 depth height width
4 3 depth height width
4 4 depth height width
4 5 depth height width
5 1 depth height width
5 2 depth height width
5 3 depth height width
5 4 depth height width
5 5 depth height width
Data situation of multiple imputation method.
iter imp variable
1 1 depth height width
1 2 depth height width
1 3 depth height width
1 4 depth height width
1 5 depth height width
2 1 depth height width
2 2 depth height width
2 3 depth height width
2 4 depth height width
2 5 depth height width
3 1 depth height width
3 2 depth height width
3 3 depth height width
3 4 depth height width
3 5 depth height width
4 1 depth height width
4 2 depth height width
4 3 depth height width
4 4 depth height width
4 5 depth height width
5 1 depth height width
5 2 depth height width
5 3 depth height width
5 4 depth height width
5 5 depth height width
According to the picture, we can view the data interpolation. The blue point is the original data, and the red point is the interpolation data, which have been added in place of the missing values. We can see that the two color points are relatively overlapped, indicating that the interpolation is very good.Then we chose the fourth database of multiple imputation for generalized linear model analysis.
Call:
glm(formula = price_level ~ sellable_online + other_colors +
depth + height + width, family = binomial(link = "logit"),
data = ikea)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8623 -0.4960 -0.2712 0.1969 2.5461
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.424137 882.743717 0.007 0.994
sellable_onlineTRUE -13.870206 882.743490 -0.016 0.987
other_colorsYes 0.231993 0.290954 0.797 0.425
depth 0.047421 0.007090 6.689 2.25e-11 ***
height 0.012512 0.002470 5.066 4.07e-07 ***
width 0.022326 0.002864 7.796 6.40e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 603.93 on 499 degrees of freedom
Residual deviance: 323.35 on 494 degrees of freedom
AIC: 335.35
Number of Fisher Scoring iterations: 13
We use price_level as the response variable. Because it is a binary variable, so we can use a logistic regression model for the probability of whether the price is greater than 1000. Through the above table, we found that the P values of the two categorical variables(sellable_online and other_colors) are both greater than 0.05, so it means that these two items are not significant in this model, and we need to eliminate these two variables. Next, we use the remaining variables to fit a new general linear model.
\[log({\widehat{\mbox{p}}_{\mbox{i}}\over{1-\widehat{\mbox{p}}_{\mbox{i}}}} )= \widehat{\alpha}+{\widehat\beta}*{\mbox{depth}}_{\mbox{i}}+\widehat{\gamma}*{\mbox{height}}_{\mbox{i}}+\widehat{\delta}*{\mbox{width}}_{\mbox{i}}\]
where
• the \(\widehat{\mbox{p}}_{\mbox{i}}\): the probability of whether the price is greater than 1000 for the \(i\mbox{th}\) furniture.
• the \(\widehat{\alpha}\): the intercept of the regression line.
• the \(\widehat{\beta}\): the coefficient for the first explanatory variable \({\mbox{depth}}\).
• the \(\widehat{\gamma}\): the coefficient for the second explanatory variable \({\mbox{height}}\).
• the \(\widehat{\delta}\): the coefficient for the second explanatory variable \({\mbox{width}}\).
When this model is fitted to the data, the following estimates of \({\alpha}\) (intercept) and \({\beta}\),\({\gamma}\) and \({\delta}\) are returned:
Call:
glm(formula = price_level ~ depth + height + width, family = binomial(link = "logit"),
data = ikea)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8754 -0.4854 -0.2761 0.2018 2.5107
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.342133 0.654665 -11.215 < 2e-16 ***
depth 0.046946 0.007027 6.681 2.37e-11 ***
height 0.012162 0.002447 4.969 6.73e-07 ***
width 0.022909 0.002833 8.086 6.19e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 603.93 on 499 degrees of freedom
Residual deviance: 324.69 on 496 degrees of freedom
AIC: 332.69
Number of Fisher Scoring iterations: 6
According to the coefficients in the above table, we can get the final model as follows: \[log({\widehat{\mbox{p}}_{\mbox{i}}\over{1-\widehat{\mbox{p}}_{\mbox{i}}}} )=-7.3421+0.0469*{\mbox{depth}}_{\mbox{i}}+0.0122*{\mbox{height}}_{\mbox{i}}+0.0229*{\mbox{width}}_{\mbox{i}}\]
This is equivalent to: \[\widehat{\mbox{p}}_{\mbox{i}}={exp(-7.3421+0.0469*{\mbox{depth}}_{\mbox{i}}+0.0122*{\mbox{height}}_{\mbox{i}}+0.0229*{\mbox{width}}_{\mbox{i}})\over1+exp(-7.3421+0.0469*{\mbox{depth}}_{\mbox{i}}+0.0122*{\mbox{height}}_{\mbox{i}}+0.0229*{\mbox{width}}_{\mbox{i}})}\]
The log-odds of the price of a furniture being greater than 1000 increase by 0.05 for every 1 unit increase in depth when keeping the height and width constant. Similarly, the log-odds increase by 0.01 for every 1 unit increase in height and increases by 0.02 for every unit increase in width.
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -8.7113103 | -6.1375780 |
| depth | 0.0339458 | 0.0614882 |
| height | 0.0074336 | 0.0170563 |
| width | 0.0176286 | 0.0287658 |
Table 2 shows the 95% confidence interval for these log-odds, with intercept being of (-8.71, -6.61), depth being of (0.03, 0.06), height being of (0.007, 0.017) and width being of (0.02, 0.03).All intervals do not contain 0, which once again confirms that these three variables are significant in the model.
| x | |
|---|---|
| (Intercept) | 0.0006477 |
| depth | 1.0480658 |
| height | 1.0122360 |
| width | 1.0231734 |
On the odds scale, the intercept value (0.00006477) gives the probability that the price is greater than 1000 when depth= 0, width=0 and height=0. This is obviously not the feasible range of depth, width and height, so why this value is very close to zero. For depth, there is a probability of 1.05, which means that for each increase in depth by 1 unit, the probability that the furniture price is greater than 1000 increases by 1.06 times. For each unit of the same height, the probability increases by 1.01 times. For each unit increase in width, the probability increases by 1.02 times.
Odds ratios of three explanatory variables.
We can also see the graphical interpretation of odds ratios of the depth, height, and width which interprets that the per unit of depth, weight, and height is increased by 1.06, 1.01, and 1.02 times respectively. Also, the increase in the probability that price greater than 1000 riyals for per unit of depth is highest than per unit of height and per unit of width and for height it is lowest.
Probability of price over 1000 by three different variables.
The above graphs helps to understand more clearly how the probability of a price exceeding 1000 varies with a single parameter: depth, height, and weight. When the depth exceeds 200, the probability that the price is greater than 1000 is the highest and gradually stabilizes, and similarly, when the width exceeds 300, the probability that the price exceeds 1000 is constant. At the same time, the relationship between the probability of the price exceeding 1000 and the height is approximately linear and positively correlated.
ROC curve.
The Receiver Operator Characteristic (ROC) curve is used for the representation of the binary classification of furniture which takes 1 for a price greater than 1000 riyals otherwise it takes 0. It is a probability curve that plots the TPR against FPR at various threshold values. The Area Under the Curve (AUC) is the measure of separability. As the AUC is 0.91924, we could detect more true positive and true negative values than false negative and false positive values. As AUC is high (close to 1) the performance of the model is also high which results in the high ability of the model to distinguish between positive and negative classes.
In this analysis, we have analyzed the dimensions of furniture which affect the price of the furniture to check whether it costs more than 1000 Saudi Riyals. To check that we set the dividing point and distributed the furniture according to the price. This binary distribution divided furniture costing more than 1000 into ‘1’ and others into ‘0’.Since the original data has many missing values, we used multiple imputation method to fill in the data, and use the filled data for analysis and modeling. According to whether the variables were significantly selected in the model, the three variables of width, depth and height were selected to create the final model and test the goodness of fit. We observed the odds ratio in which Depth, Height, and Width of these dimensions of the furniture were considered. According to the overall analysis, we observed that the dimensions of the furniture affect the cost of the furniture to be more than 1000 Saudi Riyals. probability of price to be 1000 for Width is higher that means the furniture higher width are more likely to be having cost more than 1000 Riyals meanwhile as the height increases the probability of price over 1000 riyals increases. In the case of depth, the probability of price over 1000 riyals becomes constant after some threshold value. Thus we conclude that the furniture dimensions do influence the cost of fun=rniture to be more than 1000 Riyals.
Since the length, width and height of an item are related to the price, we can consider introducing volume variable which is the product of the three as a variable into the logistic regression model in future work.